As a passionate back-end developer with a focus on business innovation, I am deeply interested in connecting the latest technology trends with practical business applications.
My passion for experimentation and finding new commercial opportunities is matched by my ability to create reliable and professional back-end systems.
English language
Pavel KrálMy presentation, titled "Large Language Models Across Languages" aims to explore the fascinating world of large language models (LLMs) and their varying efficiencies in different languages.
My talk is grounded in extensive research publicly available on GitHub, providing a robust and transparent foundation for the insights shared.
https://github.com/pavelkraleu/llm-language-tokens/tree/main
Key Points of the Talk:
* Token Economy Across Languages: An analysis of which languages require the most tokens to express the same content. This segment will dive into the tokenization process of LLMs and compare major languages on their token consumption for equivalent texts.
* Cross-Lingual Embeddings Effectiveness: Evaluating how effectively LLMs' embeddings capture and express information across different languages. This part will focus on the quality of embeddings in bridging language gaps and maintaining semantic integrity.
* Reasoning Abilities in Various Languages: Investigating the reasoning capabilities of LLMs when operating in multiple languages. The focus will be on comparing the models' performance across a spectrum of languages.
English language
Pavel KrálIn this workshop we will have a hands-on experience with creating a simple RAG (Retrieval-Augmented Generation) tool in Python. We will use the OpenAI API to calculate embeddings and integrate with GPT.
We will learn about the following topics:
- Tokenization basics: what are tokens and why are they important
- Embeddings: How they work and their importance in the context of machine learning
- Searching for similar documents: Techniques and methods for effective search
- Generating new content: How to generate content based on found documents
We will also look at the specifics of LLM (Large Language Models) models and their different behaviors when working with different languages. I have elaborated on this topic in detail in my project, which is available on Github.
https://github.com/pavelkraleu/llm-language-tokens/tree/main